Group G
- Dev Makwana (885064)
- Joel Crasto (883863)
- Krina Patel (886861)
- Mahaveersinh Chauhan (884854)
- Trushna Patel (886910)
The fundamental goal of this project is to develop an accurate and effective regression model for price prediction of Walmart groceries. Moreover, insigths are revealed with Exploratory Data Analysis.
What is the average price of products in Walmart Grocery?
What is the average product size (in terms of weight) in Walmart Grocery?
What are the most popular brands in Walmart Grocery?
What are the most popular products in Walmart Grocery?
Predict the price of the products?
Check department-wise sales at stores and for different stores of Walmart?
| Field Name | Description | Example |
|---|---|---|
| SHIPPING_LOCATION | The location where the product is shipped from. | 79936 |
| DEPARTMENT | The department in which the product is categorized. | Deli |
| CATEGORY | The category in which the product is categorized. | Hummus, Dips, & Salsa |
| SUBCATEGORY | The subcategory in which the product is categorized. | White Wine |
| BREADCRUMBS | The breadcrumbs for the product. | Deli/Hummus, Dips, & Salsa |
| SKU | The SKU for the product. | 110895339 |
| PRODUCT_URL | The URL for the product. | https://www.walmart.com/ip/Marketside-Roasted-Red-Pepper-Hummus-10-Oz/110895339?fulfillmentIntent=Pi... |
| PRODUCT_NAME | The name of the product. | Marketside Roasted Red Pepper Hummus, 10 Oz |
| BRAND | The brand of the product. | Marketside |
| PRICE_RETAIL | The retail price of the product. | 2.67 |
| PRICE_CURRENT | The current price of the product. | 2.67 |
| PRODUCT_SIZE | The size of the product. | 10 |
| PROMOTION | The promotion for the product. | NULL |
| RunDate | The date on which the data was collected. | 2022-09-11 21:20:04 |
| Tid | Transaction ID | 16163804 |
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly
import plotly.io as pio
from sklearn.decomposition import IncrementalPCA
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression, LassoCV, ElasticNetCV
from sklearn.model_selection import GridSearchCV, KFold
import math
dataset = pd.read_csv("WMT_Grocery_202209.csv", index_col=0)
c:\users\chauh\appdata\local\programs\python\python37\lib\site-packages\IPython\core\interactiveshell.py:3553: DtypeWarning: Columns (4) have mixed types.Specify dtype option on import or set low_memory=False. exec(code_obj, self.user_global_ns, self.user_ns)
dataset.head()
| SHIPPING_LOCATION | DEPARTMENT | CATEGORY | SUBCATEGORY | BREADCRUMBS | SKU | PRODUCT_URL | PRODUCT_NAME | BRAND | PRICE_RETAIL | PRICE_CURRENT | PRODUCT_SIZE | PROMOTION | RunDate | tid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| index | |||||||||||||||
| 0 | 79936 | Deli | Hummus, Dips, & Salsa | NaN | Deli/Hummus, Dips, & Salsa | 110895339 | https://www.walmart.com/ip/Marketside-Roasted-... | Marketside Roasted Red Pepper Hummus, 10 Oz | Marketside | 2.67 | 2.67 | 10 | NaN | 2022-09-11 21:20:04 | 16163804 |
| 1 | 79936 | Deli | Hummus, Dips, & Salsa | NaN | Deli/Hummus, Dips, & Salsa | 105455228 | https://www.walmart.com/ip/Marketside-Roasted-... | Marketside Roasted Garlic Hummus, 10 Oz | Marketside | 2.67 | 2.67 | 10 | NaN | 2022-09-11 21:20:04 | 16163805 |
| 2 | 79936 | Deli | Hummus, Dips, & Salsa | NaN | Deli/Hummus, Dips, & Salsa | 128642379 | https://www.walmart.com/ip/Marketside-Classic-... | Marketside Classic Hummus, 10 Oz | Marketside | 2.67 | 2.67 | 10 | NaN | 2022-09-11 21:20:04 | 16163806 |
| 3 | 79936 | Deli | Hummus, Dips, & Salsa | NaN | Deli/Hummus, Dips, & Salsa | 366126367 | https://www.walmart.com/ip/Marketside-Everythi... | Marketside Everything Hummus, 10 oz | Marketside | 2.67 | 2.67 | 10 | NaN | 2022-09-11 21:20:04 | 16163807 |
| 4 | 79936 | Deli | Hummus, Dips, & Salsa | NaN | Deli/Hummus, Dips, & Salsa | 160090316 | https://www.walmart.com/ip/Price-s-Jalapeno-Di... | Price's Jalapeno Dip, 12 Oz. | Price's | 3.12 | 3.12 | 12 | NaN | 2022-09-11 21:20:04 | 16163808 |
dataset.describe()
| SHIPPING_LOCATION | SKU | PRICE_RETAIL | PRICE_CURRENT | PROMOTION | tid | |
|---|---|---|---|---|---|---|
| count | 568534.000000 | 5.685340e+05 | 568534.000000 | 568534.000000 | 0.0 | 5.685340e+05 |
| mean | 57713.149935 | 2.453284e+08 | 5.105377 | 5.079274 | NaN | 1.644807e+07 |
| std | 24113.969156 | 3.049375e+08 | 4.824906 | 4.727971 | NaN | 1.641218e+05 |
| min | 6010.000000 | 8.795550e+05 | 0.160000 | 0.160000 | NaN | 1.616380e+07 |
| 25% | 33647.000000 | 1.555602e+07 | 2.480000 | 2.480000 | NaN | 1.630594e+07 |
| 50% | 63376.000000 | 5.003978e+07 | 3.880000 | 3.880000 | NaN | 1.644807e+07 |
| 75% | 77449.000000 | 4.448820e+08 | 5.980000 | 5.980000 | NaN | 1.659020e+07 |
| max | 96797.000000 | 1.996767e+09 | 246.870000 | 246.870000 | NaN | 1.673234e+07 |
dataset.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 568534 entries, 0 to 568533 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SHIPPING_LOCATION 568534 non-null int64 1 DEPARTMENT 568534 non-null object 2 CATEGORY 568534 non-null object 3 SUBCATEGORY 361324 non-null object 4 BREADCRUMBS 568534 non-null object 5 SKU 568534 non-null int64 6 PRODUCT_URL 568534 non-null object 7 PRODUCT_NAME 568534 non-null object 8 BRAND 568507 non-null object 9 PRICE_RETAIL 568534 non-null float64 10 PRICE_CURRENT 568534 non-null float64 11 PRODUCT_SIZE 505709 non-null object 12 PROMOTION 0 non-null float64 13 RunDate 568534 non-null object 14 tid 568534 non-null int64 dtypes: float64(3), int64(3), object(9) memory usage: 69.4+ MB
dataset.isnull().sum()
SHIPPING_LOCATION 0 DEPARTMENT 0 CATEGORY 0 SUBCATEGORY 207210 BREADCRUMBS 0 SKU 0 PRODUCT_URL 0 PRODUCT_NAME 0 BRAND 27 PRICE_RETAIL 0 PRICE_CURRENT 0 PRODUCT_SIZE 62825 PROMOTION 568534 RunDate 0 tid 0 dtype: int64
dataset
| SHIPPING_LOCATION | DEPARTMENT | CATEGORY | SUBCATEGORY | BREADCRUMBS | SKU | PRODUCT_URL | PRODUCT_NAME | BRAND | PRICE_RETAIL | PRICE_CURRENT | PRODUCT_SIZE | PROMOTION | RunDate | tid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| index | |||||||||||||||
| 0 | 79936 | Deli | Hummus, Dips, & Salsa | NaN | Deli/Hummus, Dips, & Salsa | 110895339 | https://www.walmart.com/ip/Marketside-Roasted-... | Marketside Roasted Red Pepper Hummus, 10 Oz | Marketside | 2.67 | 2.67 | 10 | NaN | 2022-09-11 21:20:04 | 16163804 |
| 1 | 79936 | Deli | Hummus, Dips, & Salsa | NaN | Deli/Hummus, Dips, & Salsa | 105455228 | https://www.walmart.com/ip/Marketside-Roasted-... | Marketside Roasted Garlic Hummus, 10 Oz | Marketside | 2.67 | 2.67 | 10 | NaN | 2022-09-11 21:20:04 | 16163805 |
| 2 | 79936 | Deli | Hummus, Dips, & Salsa | NaN | Deli/Hummus, Dips, & Salsa | 128642379 | https://www.walmart.com/ip/Marketside-Classic-... | Marketside Classic Hummus, 10 Oz | Marketside | 2.67 | 2.67 | 10 | NaN | 2022-09-11 21:20:04 | 16163806 |
| 3 | 79936 | Deli | Hummus, Dips, & Salsa | NaN | Deli/Hummus, Dips, & Salsa | 366126367 | https://www.walmart.com/ip/Marketside-Everythi... | Marketside Everything Hummus, 10 oz | Marketside | 2.67 | 2.67 | 10 | NaN | 2022-09-11 21:20:04 | 16163807 |
| 4 | 79936 | Deli | Hummus, Dips, & Salsa | NaN | Deli/Hummus, Dips, & Salsa | 160090316 | https://www.walmart.com/ip/Price-s-Jalapeno-Di... | Price's Jalapeno Dip, 12 Oz. | Price's | 3.12 | 3.12 | 12 | NaN | 2022-09-11 21:20:04 | 16163808 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 568529 | 70072 | Alcohol | Wine | White Wine | Alcohol/Wine | 593600139 | https://www.walmart.com/ip/Farm-Fresh-Blueberr... | Farm Fresh Blueberry Moscato 750ml | Farm Fresh Wine Company | 9.98 | 9.98 | 750 | NaN | 2022-09-11 21:20:04 | 16732333 |
| 568530 | 70072 | Alcohol | Wine | White Wine | Alcohol/Wine | 333403243 | https://www.walmart.com/ip/Farm-Fresh-Peach-Mo... | Farm Fresh Peach Moscato 750 Ml | Farm Fresh Wine Company | 9.98 | 9.98 | 750 | NaN | 2022-09-11 21:20:04 | 16732334 |
| 568531 | 70072 | Alcohol | Wine | White Wine | Alcohol/Wine | 526588325 | https://www.walmart.com/ip/Farm-Fresh-Raspberr... | Farm Fresh Raspberry Moscato 750ml | Farm Fresh Wine Company | 9.98 | 9.98 | 750 | NaN | 2022-09-11 21:20:04 | 16732335 |
| 568532 | 70072 | Alcohol | Wine | White Wine | Alcohol/Wine | 286992782 | https://www.walmart.com/ip/Farm-Fresh-Mango-Mo... | Farm Fresh Mango Moscato 750ml | Farm Fresh Wine Company | 9.98 | 9.98 | 750 | NaN | 2022-09-11 21:20:04 | 16732336 |
| 568533 | 70072 | Alcohol | Wine | White Wine | Alcohol/Wine | 160015930 | https://www.walmart.com/ip/Ole-Orleans-Heritag... | Ole Orleans Heritage Riesling 750ml | Ole Orleans | 18.98 | 18.98 | 750 | NaN | 2022-09-11 21:20:04 | 16732337 |
568534 rows × 15 columns
cleaned_dataset = dataset.copy()
dropping_columns = ["RunDate", "SKU", "PROMOTION", "tid", "PRODUCT_URL", "SUBCATEGORY"]
cleaned_dataset.drop(dropping_columns, axis=1, inplace=True)
cleaned_dataset
| SHIPPING_LOCATION | DEPARTMENT | CATEGORY | BREADCRUMBS | PRODUCT_NAME | BRAND | PRICE_RETAIL | PRICE_CURRENT | PRODUCT_SIZE | |
|---|---|---|---|---|---|---|---|---|---|
| index | |||||||||
| 0 | 79936 | Deli | Hummus, Dips, & Salsa | Deli/Hummus, Dips, & Salsa | Marketside Roasted Red Pepper Hummus, 10 Oz | Marketside | 2.67 | 2.67 | 10 |
| 1 | 79936 | Deli | Hummus, Dips, & Salsa | Deli/Hummus, Dips, & Salsa | Marketside Roasted Garlic Hummus, 10 Oz | Marketside | 2.67 | 2.67 | 10 |
| 2 | 79936 | Deli | Hummus, Dips, & Salsa | Deli/Hummus, Dips, & Salsa | Marketside Classic Hummus, 10 Oz | Marketside | 2.67 | 2.67 | 10 |
| 3 | 79936 | Deli | Hummus, Dips, & Salsa | Deli/Hummus, Dips, & Salsa | Marketside Everything Hummus, 10 oz | Marketside | 2.67 | 2.67 | 10 |
| 4 | 79936 | Deli | Hummus, Dips, & Salsa | Deli/Hummus, Dips, & Salsa | Price's Jalapeno Dip, 12 Oz. | Price's | 3.12 | 3.12 | 12 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 568529 | 70072 | Alcohol | Wine | Alcohol/Wine | Farm Fresh Blueberry Moscato 750ml | Farm Fresh Wine Company | 9.98 | 9.98 | 750 |
| 568530 | 70072 | Alcohol | Wine | Alcohol/Wine | Farm Fresh Peach Moscato 750 Ml | Farm Fresh Wine Company | 9.98 | 9.98 | 750 |
| 568531 | 70072 | Alcohol | Wine | Alcohol/Wine | Farm Fresh Raspberry Moscato 750ml | Farm Fresh Wine Company | 9.98 | 9.98 | 750 |
| 568532 | 70072 | Alcohol | Wine | Alcohol/Wine | Farm Fresh Mango Moscato 750ml | Farm Fresh Wine Company | 9.98 | 9.98 | 750 |
| 568533 | 70072 | Alcohol | Wine | Alcohol/Wine | Ole Orleans Heritage Riesling 750ml | Ole Orleans | 18.98 | 18.98 | 750 |
568534 rows × 9 columns
cleaned_dataset["PRODUCT_SIZE"] = pd.to_numeric(cleaned_dataset["PRODUCT_SIZE"],errors='coerce')
cleaned_dataset
| SHIPPING_LOCATION | DEPARTMENT | CATEGORY | BREADCRUMBS | PRODUCT_NAME | BRAND | PRICE_RETAIL | PRICE_CURRENT | PRODUCT_SIZE | |
|---|---|---|---|---|---|---|---|---|---|
| index | |||||||||
| 0 | 79936 | Deli | Hummus, Dips, & Salsa | Deli/Hummus, Dips, & Salsa | Marketside Roasted Red Pepper Hummus, 10 Oz | Marketside | 2.67 | 2.67 | 10.0 |
| 1 | 79936 | Deli | Hummus, Dips, & Salsa | Deli/Hummus, Dips, & Salsa | Marketside Roasted Garlic Hummus, 10 Oz | Marketside | 2.67 | 2.67 | 10.0 |
| 2 | 79936 | Deli | Hummus, Dips, & Salsa | Deli/Hummus, Dips, & Salsa | Marketside Classic Hummus, 10 Oz | Marketside | 2.67 | 2.67 | 10.0 |
| 3 | 79936 | Deli | Hummus, Dips, & Salsa | Deli/Hummus, Dips, & Salsa | Marketside Everything Hummus, 10 oz | Marketside | 2.67 | 2.67 | 10.0 |
| 4 | 79936 | Deli | Hummus, Dips, & Salsa | Deli/Hummus, Dips, & Salsa | Price's Jalapeno Dip, 12 Oz. | Price's | 3.12 | 3.12 | 12.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 568529 | 70072 | Alcohol | Wine | Alcohol/Wine | Farm Fresh Blueberry Moscato 750ml | Farm Fresh Wine Company | 9.98 | 9.98 | 750.0 |
| 568530 | 70072 | Alcohol | Wine | Alcohol/Wine | Farm Fresh Peach Moscato 750 Ml | Farm Fresh Wine Company | 9.98 | 9.98 | 750.0 |
| 568531 | 70072 | Alcohol | Wine | Alcohol/Wine | Farm Fresh Raspberry Moscato 750ml | Farm Fresh Wine Company | 9.98 | 9.98 | 750.0 |
| 568532 | 70072 | Alcohol | Wine | Alcohol/Wine | Farm Fresh Mango Moscato 750ml | Farm Fresh Wine Company | 9.98 | 9.98 | 750.0 |
| 568533 | 70072 | Alcohol | Wine | Alcohol/Wine | Ole Orleans Heritage Riesling 750ml | Ole Orleans | 18.98 | 18.98 | 750.0 |
568534 rows × 9 columns
cleaned_dataset.isnull().sum()
SHIPPING_LOCATION 0 DEPARTMENT 0 CATEGORY 0 BREADCRUMBS 0 PRODUCT_NAME 0 BRAND 27 PRICE_RETAIL 0 PRICE_CURRENT 0 PRODUCT_SIZE 63560 dtype: int64
cleaned_dataset.dropna(inplace=True)
cleaned_dataset.isnull().sum()
SHIPPING_LOCATION 0 DEPARTMENT 0 CATEGORY 0 BREADCRUMBS 0 PRODUCT_NAME 0 BRAND 0 PRICE_RETAIL 0 PRICE_CURRENT 0 PRODUCT_SIZE 0 dtype: int64
cleaned_dataset.reset_index(drop=True, inplace=True)
cleaned_dataset
| SHIPPING_LOCATION | DEPARTMENT | CATEGORY | BREADCRUMBS | PRODUCT_NAME | BRAND | PRICE_RETAIL | PRICE_CURRENT | PRODUCT_SIZE | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 79936 | Deli | Hummus, Dips, & Salsa | Deli/Hummus, Dips, & Salsa | Marketside Roasted Red Pepper Hummus, 10 Oz | Marketside | 2.67 | 2.67 | 10.0 |
| 1 | 79936 | Deli | Hummus, Dips, & Salsa | Deli/Hummus, Dips, & Salsa | Marketside Roasted Garlic Hummus, 10 Oz | Marketside | 2.67 | 2.67 | 10.0 |
| 2 | 79936 | Deli | Hummus, Dips, & Salsa | Deli/Hummus, Dips, & Salsa | Marketside Classic Hummus, 10 Oz | Marketside | 2.67 | 2.67 | 10.0 |
| 3 | 79936 | Deli | Hummus, Dips, & Salsa | Deli/Hummus, Dips, & Salsa | Marketside Everything Hummus, 10 oz | Marketside | 2.67 | 2.67 | 10.0 |
| 4 | 79936 | Deli | Hummus, Dips, & Salsa | Deli/Hummus, Dips, & Salsa | Price's Jalapeno Dip, 12 Oz. | Price's | 3.12 | 3.12 | 12.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 504969 | 70072 | Alcohol | Wine | Alcohol/Wine | Farm Fresh Blueberry Moscato 750ml | Farm Fresh Wine Company | 9.98 | 9.98 | 750.0 |
| 504970 | 70072 | Alcohol | Wine | Alcohol/Wine | Farm Fresh Peach Moscato 750 Ml | Farm Fresh Wine Company | 9.98 | 9.98 | 750.0 |
| 504971 | 70072 | Alcohol | Wine | Alcohol/Wine | Farm Fresh Raspberry Moscato 750ml | Farm Fresh Wine Company | 9.98 | 9.98 | 750.0 |
| 504972 | 70072 | Alcohol | Wine | Alcohol/Wine | Farm Fresh Mango Moscato 750ml | Farm Fresh Wine Company | 9.98 | 9.98 | 750.0 |
| 504973 | 70072 | Alcohol | Wine | Alcohol/Wine | Ole Orleans Heritage Riesling 750ml | Ole Orleans | 18.98 | 18.98 | 750.0 |
504974 rows × 9 columns
eda_df = cleaned_dataset.copy()
eda_df
| SHIPPING_LOCATION | DEPARTMENT | CATEGORY | BREADCRUMBS | PRODUCT_NAME | BRAND | PRICE_RETAIL | PRICE_CURRENT | PRODUCT_SIZE | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 79936 | Deli | Hummus, Dips, & Salsa | Deli/Hummus, Dips, & Salsa | Marketside Roasted Red Pepper Hummus, 10 Oz | Marketside | 2.67 | 2.67 | 10.0 |
| 1 | 79936 | Deli | Hummus, Dips, & Salsa | Deli/Hummus, Dips, & Salsa | Marketside Roasted Garlic Hummus, 10 Oz | Marketside | 2.67 | 2.67 | 10.0 |
| 2 | 79936 | Deli | Hummus, Dips, & Salsa | Deli/Hummus, Dips, & Salsa | Marketside Classic Hummus, 10 Oz | Marketside | 2.67 | 2.67 | 10.0 |
| 3 | 79936 | Deli | Hummus, Dips, & Salsa | Deli/Hummus, Dips, & Salsa | Marketside Everything Hummus, 10 oz | Marketside | 2.67 | 2.67 | 10.0 |
| 4 | 79936 | Deli | Hummus, Dips, & Salsa | Deli/Hummus, Dips, & Salsa | Price's Jalapeno Dip, 12 Oz. | Price's | 3.12 | 3.12 | 12.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 504969 | 70072 | Alcohol | Wine | Alcohol/Wine | Farm Fresh Blueberry Moscato 750ml | Farm Fresh Wine Company | 9.98 | 9.98 | 750.0 |
| 504970 | 70072 | Alcohol | Wine | Alcohol/Wine | Farm Fresh Peach Moscato 750 Ml | Farm Fresh Wine Company | 9.98 | 9.98 | 750.0 |
| 504971 | 70072 | Alcohol | Wine | Alcohol/Wine | Farm Fresh Raspberry Moscato 750ml | Farm Fresh Wine Company | 9.98 | 9.98 | 750.0 |
| 504972 | 70072 | Alcohol | Wine | Alcohol/Wine | Farm Fresh Mango Moscato 750ml | Farm Fresh Wine Company | 9.98 | 9.98 | 750.0 |
| 504973 | 70072 | Alcohol | Wine | Alcohol/Wine | Ole Orleans Heritage Riesling 750ml | Ole Orleans | 18.98 | 18.98 | 750.0 |
504974 rows × 9 columns
corr_train = eda_df.corr()
fig = px.imshow(corr_train, text_auto=True, aspect="auto", color_continuous_scale='RdBu_r')
fig.update_layout(title_text='Correlation Matrix', title_x=0.5, width=950, height=800)
fig.show()
avg_price = eda_df.groupby(["DEPARTMENT"]).agg({"PRICE_RETAIL" : "mean"}).reset_index()
fig = px.bar(avg_price, x='DEPARTMENT', y='PRICE_RETAIL',text_auto='.2s', labels={"DEPARTMENT" : "Department", "PRICE_RETAIL" : "Price"})
fig.update_layout(title_text='Average Price By Department', title_x=0.5, width=1000, height=650)
fig.show()
avg_size = eda_df.groupby(["DEPARTMENT"]).agg({"PRODUCT_SIZE" : "mean"}).reset_index()
fig = px.bar(avg_size, x='DEPARTMENT', y='PRODUCT_SIZE',text_auto='.2s', labels={"DEPARTMENT" : "Department", "PRODUCT_SIZE" : "Product Size"})
fig.update_layout(title_text='Average Product Size By Department', title_x=0.5, width=1000, height=650)
fig.show()
popular_brand = pd.DataFrame(eda_df["BRAND"].value_counts()).reset_index().sort_values(by=["BRAND"], ascending=False)
fig = px.pie(popular_brand[:10], values='BRAND', names='index')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(title_text='Top 10 Popular Brands', title_x=0.45)
fig.show()
popular_products = pd.DataFrame(eda_df["PRODUCT_NAME"].value_counts()).reset_index().sort_values(by=["PRODUCT_NAME"], ascending=False)
fig = px.bar(popular_products[:20], x='index', y='PRODUCT_NAME',text_auto='.2s', labels={"PRODUCT_NAME" : "Product Count", "index" : "Product Name"})
fig.update_layout(title_text='Top 20 Popular Products', title_x=0.5, width=1000, height=650)
fig.update_xaxes(tickangle = 45,automargin = False)
fig.show()
Following stacked bar graph represents the total of all prices at different shipping locations department wise. It is observed that Baking and Breakfast and Cereal Department shows the highest prices of products while Fresh Produce and Meat and Seafood shows the lowest product prices.
price_by_shipping_location = eda_df.groupby(["SHIPPING_LOCATION", "DEPARTMENT"]).agg({"PRICE_RETAIL" : "sum"}).reset_index()
price_by_shipping_location["SHIPPING_LOCATION"] = price_by_shipping_location["SHIPPING_LOCATION"].apply(lambda x : str(x))
fig = px.bar(price_by_shipping_location, x='SHIPPING_LOCATION', y='PRICE_RETAIL', color="DEPARTMENT", labels={"SHIPPING_LOCATION" : "Store Locations", "PRICE_RETAIL" : "Price", "DEPARTMENT" : "Department"})
fig.update_layout(title_text='Total Price by Store Location corresponding to different Departments', title_x=0.5, width=1000, height=650)
fig.show()
The pie chart below shows the percentage count of number of products with respect to various departments. Pantry and Breakfast and Cereals department has the highest products while Fresh Produce and Alcohol had lesser products.
count_by_department = eda_df.groupby("DEPARTMENT").agg({'PRODUCT_NAME':'count'}).reset_index()
fig = px.pie(count_by_department, values='PRODUCT_NAME', names='DEPARTMENT', hole=.5, color_discrete_sequence=px.colors.sequential.RdBu)
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(annotations=[dict(text='Department', x=0.5, y=0.5, font_size=30, showarrow=False)])
fig.update_layout(title_text='Percentage of Product counts by Department', title_x=0.45)
fig.show()
The Meat and Seafood section observed the highest sale of around 200k followed by Breakfast and Cereals department which makes around 140k.
top_20_category = eda_df.groupby(["CATEGORY", "DEPARTMENT"]).agg({"PRICE_RETAIL" : "sum"}).reset_index()
top_20_category = top_20_category.sort_values(by=["PRICE_RETAIL"], ascending=False)
fig = px.bar(top_20_category[:20], x='CATEGORY', y='PRICE_RETAIL', color="DEPARTMENT", text_auto='.2s', labels={"CATEGORY" : "Category", "PRICE_RETAIL" : "Price", "DEPARTMENT" : "Department"})
fig.update_layout(title_text='Total Price of Top 20 Categories', title_x=0.5, width=950, height=650)
fig.show()
Below chart shows the top 20 selling brands of all products.
top_20_category = eda_df.groupby(["BRAND"]).agg({"PRICE_RETAIL" : "mean"}).reset_index()
top_20_category = top_20_category.sort_values(by=["PRICE_RETAIL"], ascending=False)
fig = px.bar(top_20_category[:20], x='BRAND', y='PRICE_RETAIL', text_auto='.2s', labels={"BRAND" : "Brand", "PRICE_RETAIL" : "Price"})
fig.update_layout(title_text='Average Price of Top 20 Brands', title_x=0.5, width=950, height=650)
fig.show()
The Graph says that the price distribution is right skewed. It is not normally distributed so we need to convert it to normal distribution.
fig = px.histogram(eda_df, x="PRICE_RETAIL", color="DEPARTMENT", nbins=50, labels={"DEPARTMENT" : "Department", "PRICE_RETAIL" : "Price", "count" : "Count"})
fig.update_layout(title_text='Price Distribution by Department', title_x=0.5, width=950, height=650)
fig.show()
To convert the price into normal distribution we have applied a log transformation function.
logged_df = pd.DataFrame()
logged_df["DEPARTMENT"] = eda_df[["DEPARTMENT"]]
logged_df["LOG_PRICE"] = eda_df["PRICE_RETAIL"].apply(lambda x : np.log(x) if np.log(x) > 0 else 0)
fig = px.histogram(logged_df, x="LOG_PRICE", color="DEPARTMENT", nbins=50, labels={"count" : "Value Count", "LOG_PRICE" : "Log Price", "DEPARTMENT" : "Department"})
fig.update_layout(title_text='Logarithmic Price Distribution by Department', title_x=0.5, width=950, height=650)
fig.show()
featured_dataset = cleaned_dataset.copy()
featured_dataset
| SHIPPING_LOCATION | DEPARTMENT | CATEGORY | BREADCRUMBS | PRODUCT_NAME | BRAND | PRICE_RETAIL | PRICE_CURRENT | PRODUCT_SIZE | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 79936 | Deli | Hummus, Dips, & Salsa | Deli/Hummus, Dips, & Salsa | Marketside Roasted Red Pepper Hummus, 10 Oz | Marketside | 2.67 | 2.67 | 10.0 |
| 1 | 79936 | Deli | Hummus, Dips, & Salsa | Deli/Hummus, Dips, & Salsa | Marketside Roasted Garlic Hummus, 10 Oz | Marketside | 2.67 | 2.67 | 10.0 |
| 2 | 79936 | Deli | Hummus, Dips, & Salsa | Deli/Hummus, Dips, & Salsa | Marketside Classic Hummus, 10 Oz | Marketside | 2.67 | 2.67 | 10.0 |
| 3 | 79936 | Deli | Hummus, Dips, & Salsa | Deli/Hummus, Dips, & Salsa | Marketside Everything Hummus, 10 oz | Marketside | 2.67 | 2.67 | 10.0 |
| 4 | 79936 | Deli | Hummus, Dips, & Salsa | Deli/Hummus, Dips, & Salsa | Price's Jalapeno Dip, 12 Oz. | Price's | 3.12 | 3.12 | 12.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 504969 | 70072 | Alcohol | Wine | Alcohol/Wine | Farm Fresh Blueberry Moscato 750ml | Farm Fresh Wine Company | 9.98 | 9.98 | 750.0 |
| 504970 | 70072 | Alcohol | Wine | Alcohol/Wine | Farm Fresh Peach Moscato 750 Ml | Farm Fresh Wine Company | 9.98 | 9.98 | 750.0 |
| 504971 | 70072 | Alcohol | Wine | Alcohol/Wine | Farm Fresh Raspberry Moscato 750ml | Farm Fresh Wine Company | 9.98 | 9.98 | 750.0 |
| 504972 | 70072 | Alcohol | Wine | Alcohol/Wine | Farm Fresh Mango Moscato 750ml | Farm Fresh Wine Company | 9.98 | 9.98 | 750.0 |
| 504973 | 70072 | Alcohol | Wine | Alcohol/Wine | Ole Orleans Heritage Riesling 750ml | Ole Orleans | 18.98 | 18.98 | 750.0 |
504974 rows × 9 columns
redudant_columns = ["SHIPPING_LOCATION", "BREADCRUMBS", "PRODUCT_NAME", "PRICE_CURRENT"]
featured_dataset.drop(redudant_columns, axis=1, inplace=True)
featured_dataset
| DEPARTMENT | CATEGORY | BRAND | PRICE_RETAIL | PRODUCT_SIZE | |
|---|---|---|---|---|---|
| 0 | Deli | Hummus, Dips, & Salsa | Marketside | 2.67 | 10.0 |
| 1 | Deli | Hummus, Dips, & Salsa | Marketside | 2.67 | 10.0 |
| 2 | Deli | Hummus, Dips, & Salsa | Marketside | 2.67 | 10.0 |
| 3 | Deli | Hummus, Dips, & Salsa | Marketside | 2.67 | 10.0 |
| 4 | Deli | Hummus, Dips, & Salsa | Price's | 3.12 | 12.0 |
| ... | ... | ... | ... | ... | ... |
| 504969 | Alcohol | Wine | Farm Fresh Wine Company | 9.98 | 750.0 |
| 504970 | Alcohol | Wine | Farm Fresh Wine Company | 9.98 | 750.0 |
| 504971 | Alcohol | Wine | Farm Fresh Wine Company | 9.98 | 750.0 |
| 504972 | Alcohol | Wine | Farm Fresh Wine Company | 9.98 | 750.0 |
| 504973 | Alcohol | Wine | Ole Orleans | 18.98 | 750.0 |
504974 rows × 5 columns
dummy_df = pd.get_dummies(featured_dataset[['DEPARTMENT','CATEGORY', 'BRAND']])
dummy_df
| DEPARTMENT_Alcohol | DEPARTMENT_Bakery & Bread | DEPARTMENT_Baking | DEPARTMENT_Beverages | DEPARTMENT_Breakfast & Cereal | DEPARTMENT_Candy | DEPARTMENT_Coffee | DEPARTMENT_Dairy & Eggs | DEPARTMENT_Deli | DEPARTMENT_Fresh Produce | ... | BRAND_natural bliss | BRAND_neuro drinks | BRAND_new | BRAND_noosa | BRAND_nutpods | BRAND_popchips | BRAND_sea_best | BRAND_simply... | BRAND_vibi+ | BRAND_vitaminwater | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 504969 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 504970 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 504971 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 504972 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 504973 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
504974 rows × 4016 columns
pca = PCA(n_components=25) # Choose the number of components you want to retain
pca_result = pca.fit_transform(dummy_df)
pca_df = pd.DataFrame(data=pca_result)
pca_df
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.189984 | -0.248196 | -0.118577 | -0.067075 | -0.066656 | -0.017657 | 0.015691 | 0.317993 | 0.514401 | -0.419694 | ... | 0.004211 | 0.458439 | -0.008684 | -0.493213 | 0.026586 | 0.030831 | -0.056889 | 0.047118 | -0.076353 | 0.162100 |
| 1 | -0.189984 | -0.248196 | -0.118577 | -0.067075 | -0.066656 | -0.017657 | 0.015691 | 0.317993 | 0.514401 | -0.419694 | ... | 0.004211 | 0.458439 | -0.008684 | -0.493213 | 0.026586 | 0.030831 | -0.056889 | 0.047118 | -0.076353 | 0.162100 |
| 2 | -0.189984 | -0.248196 | -0.118577 | -0.067075 | -0.066656 | -0.017657 | 0.015691 | 0.317993 | 0.514401 | -0.419694 | ... | 0.004211 | 0.458439 | -0.008684 | -0.493213 | 0.026586 | 0.030831 | -0.056889 | 0.047118 | -0.076353 | 0.162100 |
| 3 | -0.189984 | -0.248196 | -0.118577 | -0.067075 | -0.066656 | -0.017657 | 0.015691 | 0.317993 | 0.514401 | -0.419694 | ... | 0.004211 | 0.458439 | -0.008684 | -0.493213 | 0.026586 | 0.030831 | -0.056889 | 0.047118 | -0.076353 | 0.162100 |
| 4 | -0.162840 | -0.188059 | -0.083084 | -0.134436 | -0.061704 | -0.007414 | 0.020106 | 0.262681 | 0.380736 | -0.480189 | ... | -0.003061 | -0.121335 | 0.006873 | 0.027117 | 0.004595 | 0.003790 | 0.004026 | -0.003033 | -0.003043 | 0.002084 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 504969 | -0.149875 | -0.174667 | -0.091255 | -0.131305 | -0.057724 | -0.013900 | -0.000124 | 0.149545 | 0.084681 | -0.093395 | ... | -0.007462 | -0.215112 | -0.001552 | -0.053304 | 0.006218 | 0.032867 | 0.003413 | 0.010249 | -0.019786 | 0.067232 |
| 504970 | -0.149875 | -0.174667 | -0.091255 | -0.131305 | -0.057724 | -0.013900 | -0.000124 | 0.149545 | 0.084681 | -0.093395 | ... | -0.007462 | -0.215112 | -0.001552 | -0.053304 | 0.006218 | 0.032867 | 0.003413 | 0.010249 | -0.019786 | 0.067232 |
| 504971 | -0.149875 | -0.174667 | -0.091255 | -0.131305 | -0.057724 | -0.013900 | -0.000124 | 0.149545 | 0.084681 | -0.093395 | ... | -0.007462 | -0.215112 | -0.001552 | -0.053304 | 0.006218 | 0.032867 | 0.003413 | 0.010249 | -0.019786 | 0.067232 |
| 504972 | -0.149875 | -0.174667 | -0.091255 | -0.131305 | -0.057724 | -0.013900 | -0.000124 | 0.149545 | 0.084681 | -0.093395 | ... | -0.007462 | -0.215112 | -0.001552 | -0.053304 | 0.006218 | 0.032867 | 0.003413 | 0.010249 | -0.019786 | 0.067232 |
| 504973 | -0.149853 | -0.174627 | -0.091233 | -0.131267 | -0.057705 | -0.013895 | -0.000124 | 0.149477 | 0.084634 | -0.093337 | ... | -0.007454 | -0.214887 | -0.001550 | -0.053231 | 0.006210 | 0.032816 | 0.003407 | 0.010243 | -0.019743 | 0.067104 |
504974 rows × 25 columns
featured_dataset.drop(["DEPARTMENT", "CATEGORY", "BRAND"], axis=1, inplace=True)
featured_dataset = pd.concat([featured_dataset, pca_df], axis=1)
featured_dataset
| PRICE_RETAIL | PRODUCT_SIZE | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | ... | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.67 | 10.0 | -0.189984 | -0.248196 | -0.118577 | -0.067075 | -0.066656 | -0.017657 | 0.015691 | 0.317993 | ... | 0.004211 | 0.458439 | -0.008684 | -0.493213 | 0.026586 | 0.030831 | -0.056889 | 0.047118 | -0.076353 | 0.162100 |
| 1 | 2.67 | 10.0 | -0.189984 | -0.248196 | -0.118577 | -0.067075 | -0.066656 | -0.017657 | 0.015691 | 0.317993 | ... | 0.004211 | 0.458439 | -0.008684 | -0.493213 | 0.026586 | 0.030831 | -0.056889 | 0.047118 | -0.076353 | 0.162100 |
| 2 | 2.67 | 10.0 | -0.189984 | -0.248196 | -0.118577 | -0.067075 | -0.066656 | -0.017657 | 0.015691 | 0.317993 | ... | 0.004211 | 0.458439 | -0.008684 | -0.493213 | 0.026586 | 0.030831 | -0.056889 | 0.047118 | -0.076353 | 0.162100 |
| 3 | 2.67 | 10.0 | -0.189984 | -0.248196 | -0.118577 | -0.067075 | -0.066656 | -0.017657 | 0.015691 | 0.317993 | ... | 0.004211 | 0.458439 | -0.008684 | -0.493213 | 0.026586 | 0.030831 | -0.056889 | 0.047118 | -0.076353 | 0.162100 |
| 4 | 3.12 | 12.0 | -0.162840 | -0.188059 | -0.083084 | -0.134436 | -0.061704 | -0.007414 | 0.020106 | 0.262681 | ... | -0.003061 | -0.121335 | 0.006873 | 0.027117 | 0.004595 | 0.003790 | 0.004026 | -0.003033 | -0.003043 | 0.002084 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 504969 | 9.98 | 750.0 | -0.149875 | -0.174667 | -0.091255 | -0.131305 | -0.057724 | -0.013900 | -0.000124 | 0.149545 | ... | -0.007462 | -0.215112 | -0.001552 | -0.053304 | 0.006218 | 0.032867 | 0.003413 | 0.010249 | -0.019786 | 0.067232 |
| 504970 | 9.98 | 750.0 | -0.149875 | -0.174667 | -0.091255 | -0.131305 | -0.057724 | -0.013900 | -0.000124 | 0.149545 | ... | -0.007462 | -0.215112 | -0.001552 | -0.053304 | 0.006218 | 0.032867 | 0.003413 | 0.010249 | -0.019786 | 0.067232 |
| 504971 | 9.98 | 750.0 | -0.149875 | -0.174667 | -0.091255 | -0.131305 | -0.057724 | -0.013900 | -0.000124 | 0.149545 | ... | -0.007462 | -0.215112 | -0.001552 | -0.053304 | 0.006218 | 0.032867 | 0.003413 | 0.010249 | -0.019786 | 0.067232 |
| 504972 | 9.98 | 750.0 | -0.149875 | -0.174667 | -0.091255 | -0.131305 | -0.057724 | -0.013900 | -0.000124 | 0.149545 | ... | -0.007462 | -0.215112 | -0.001552 | -0.053304 | 0.006218 | 0.032867 | 0.003413 | 0.010249 | -0.019786 | 0.067232 |
| 504973 | 18.98 | 750.0 | -0.149853 | -0.174627 | -0.091233 | -0.131267 | -0.057705 | -0.013895 | -0.000124 | 0.149477 | ... | -0.007454 | -0.214887 | -0.001550 | -0.053231 | 0.006210 | 0.032816 | 0.003407 | 0.010243 | -0.019743 | 0.067104 |
504974 rows × 27 columns
train_test_dataset = featured_dataset.copy()
X = train_test_dataset.drop(["PRICE_RETAIL"], axis=1)
y = train_test_dataset[["PRICE_RETAIL"]]
X
| PRODUCT_SIZE | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10.0 | -0.189984 | -0.248196 | -0.118577 | -0.067075 | -0.066656 | -0.017657 | 0.015691 | 0.317993 | 0.514401 | ... | 0.004211 | 0.458439 | -0.008684 | -0.493213 | 0.026586 | 0.030831 | -0.056889 | 0.047118 | -0.076353 | 0.162100 |
| 1 | 10.0 | -0.189984 | -0.248196 | -0.118577 | -0.067075 | -0.066656 | -0.017657 | 0.015691 | 0.317993 | 0.514401 | ... | 0.004211 | 0.458439 | -0.008684 | -0.493213 | 0.026586 | 0.030831 | -0.056889 | 0.047118 | -0.076353 | 0.162100 |
| 2 | 10.0 | -0.189984 | -0.248196 | -0.118577 | -0.067075 | -0.066656 | -0.017657 | 0.015691 | 0.317993 | 0.514401 | ... | 0.004211 | 0.458439 | -0.008684 | -0.493213 | 0.026586 | 0.030831 | -0.056889 | 0.047118 | -0.076353 | 0.162100 |
| 3 | 10.0 | -0.189984 | -0.248196 | -0.118577 | -0.067075 | -0.066656 | -0.017657 | 0.015691 | 0.317993 | 0.514401 | ... | 0.004211 | 0.458439 | -0.008684 | -0.493213 | 0.026586 | 0.030831 | -0.056889 | 0.047118 | -0.076353 | 0.162100 |
| 4 | 12.0 | -0.162840 | -0.188059 | -0.083084 | -0.134436 | -0.061704 | -0.007414 | 0.020106 | 0.262681 | 0.380736 | ... | -0.003061 | -0.121335 | 0.006873 | 0.027117 | 0.004595 | 0.003790 | 0.004026 | -0.003033 | -0.003043 | 0.002084 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 504969 | 750.0 | -0.149875 | -0.174667 | -0.091255 | -0.131305 | -0.057724 | -0.013900 | -0.000124 | 0.149545 | 0.084681 | ... | -0.007462 | -0.215112 | -0.001552 | -0.053304 | 0.006218 | 0.032867 | 0.003413 | 0.010249 | -0.019786 | 0.067232 |
| 504970 | 750.0 | -0.149875 | -0.174667 | -0.091255 | -0.131305 | -0.057724 | -0.013900 | -0.000124 | 0.149545 | 0.084681 | ... | -0.007462 | -0.215112 | -0.001552 | -0.053304 | 0.006218 | 0.032867 | 0.003413 | 0.010249 | -0.019786 | 0.067232 |
| 504971 | 750.0 | -0.149875 | -0.174667 | -0.091255 | -0.131305 | -0.057724 | -0.013900 | -0.000124 | 0.149545 | 0.084681 | ... | -0.007462 | -0.215112 | -0.001552 | -0.053304 | 0.006218 | 0.032867 | 0.003413 | 0.010249 | -0.019786 | 0.067232 |
| 504972 | 750.0 | -0.149875 | -0.174667 | -0.091255 | -0.131305 | -0.057724 | -0.013900 | -0.000124 | 0.149545 | 0.084681 | ... | -0.007462 | -0.215112 | -0.001552 | -0.053304 | 0.006218 | 0.032867 | 0.003413 | 0.010249 | -0.019786 | 0.067232 |
| 504973 | 750.0 | -0.149853 | -0.174627 | -0.091233 | -0.131267 | -0.057705 | -0.013895 | -0.000124 | 0.149477 | 0.084634 | ... | -0.007454 | -0.214887 | -0.001550 | -0.053231 | 0.006210 | 0.032816 | 0.003407 | 0.010243 | -0.019743 | 0.067104 |
504974 rows × 26 columns
y
| PRICE_RETAIL | |
|---|---|
| 0 | 2.67 |
| 1 | 2.67 |
| 2 | 2.67 |
| 3 | 2.67 |
| 4 | 3.12 |
| ... | ... |
| 504969 | 9.98 |
| 504970 | 9.98 |
| 504971 | 9.98 |
| 504972 | 9.98 |
| 504973 | 18.98 |
504974 rows × 1 columns
y = np.log(y)
y
| PRICE_RETAIL | |
|---|---|
| 0 | 0.982078 |
| 1 | 0.982078 |
| 2 | 0.982078 |
| 3 | 0.982078 |
| 4 | 1.137833 |
| ... | ... |
| 504969 | 2.300583 |
| 504970 | 2.300583 |
| 504971 | 2.300583 |
| 504972 | 2.300583 |
| 504973 | 2.943386 |
504974 rows × 1 columns
def base_models(X, y, test_size=0.2, random_state=42):
models = {
'Linear Regression': LinearRegression(),
'Lasso Regression': LassoCV(random_state=random_state),
'Decision Tree': RandomForestRegressor(random_state=random_state)
}
X[25]= pd.DataFrame(Mscaler.fit_transform(X[['PRODUCT_SIZE']]))
X.drop(['PRODUCT_SIZE'], axis=1, inplace=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
evaluation_results = {}
for model_name, model in models.items():
# Train the model
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
y_pred_train = model.predict(X_train)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
r2_t = r2_score(y_train, y_pred_train)
# Store evaluation results in a dictionary
evaluation_results[model_name] = {
'MSE': mse,
'MAE': mae,
'R2 test': r2,
'R2 train' : r2_t
}
return evaluation_results
base_models(X, y, test_size=0.2, random_state=42)
C:\Users\dmaka\anaconda3\envs\Tflow1\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:1568: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). C:\Users\dmaka\AppData\Local\Temp\ipykernel_7368\131706591.py:18: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
{'Linear Regression': {'MSE': 0.38256908001293427,
'MAE': 0.47927570750897064,
'RMSE': 0.618521689201708,
'R2 test': 0.214680528116167,
'R2 train': 0.21564240024117065},
'Lasso Regression': {'MSE': 0.38259045806527975,
'MAE': 0.4791587849682495,
'RMSE': 0.6185389705307821,
'R2 test': 0.21463664427490803,
'R2 train': 0.21553497121456755},
'Decision Tree': {'MSE': 0.046740671160948113,
'MAE': 0.09572179969210073,
'RMSE': 0.21619590921418497,
'R2 test': 0.9040529904027511,
'R2 train': 0.9139985335752849}}
As Linear and Lasso regression are not able to handle Non-linearity and not robust to outliers.
In comparison to the Linear Regression and Lasso Regression models, the Decision Tree model performs noticeably better in terms of MSE and MAE. The Decision Tree's lower MSE and MAE scores reflect that it generates more precise predictions and more closely fits the real values.
The MSE and MAE values for Lasso Regression and Linear Regression are higher, indicating that these models would be less successful than the Decision Tree at identifying the underlying patterns in the data.
def evaluate_regression_models(X_, y, test_size=0.2, random_state=42):
# Initialize the models
models = {
'CatBoost': CatBoostRegressor(random_state=random_state, silent=True),
'XGBoost': XGBRegressor(random_state=random_state, verbosity=0),
'RandomForest': RandomForestRegressor(random_state=random_state)
}
Mscaler = MinMaxScaler(feature_range=(0,1))
X_[25]= pd.DataFrame(Mscaler.fit_transform(X[['PRODUCT_SIZE']]))
X_ = X.drop(['PRODUCT_SIZE'], axis=1)
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_, y, test_size=test_size, random_state=random_state)
evaluation_results = {}
for model_name, model in models.items():
# Train the model
model.fit(X_train, y_train)
# model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
y_pred_train = model.predict(X_train)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
r2_t = r2_score(y_train, y_pred_train)
# Store evaluation results in a dictionary
evaluation_results[model_name] = {
'MSE': mse,
'MAE': mae,
'R2 test': r2,
'R2 train' : r2_t
}
return evaluation_results
evaluate_regression_models(X, y, test_size=0.2, random_state=42)
C:\Users\dmaka\AppData\Local\Temp\ipykernel_7368\2068173715.py:23: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
{'CatBoost': {'MSE': 0.1257727503994698,
'MAE': 0.2530568934650479,
'R2 test': 0.7418197259492341,
'R2 train': 0.7473200461957551},
'XGBoost': {'MSE': 0.12138713590923876,
'MAE': 0.24845567036839641,
'R2 test': 0.7508223051833893,
'R2 train': 0.7560006893486932},
'RandomForest': {'MSE': 0.046740671160948113,
'MAE': 0.09572179969210073,
'R2 test': 0.9040529904027511,
'R2 train': 0.9139985335752849}}
def hyperparameter_tuning_RandomForest(X, y, test_size=0.2, random_state=42):
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
# Define parameter grid for RandomizedSearchCV
param_grid = {
'n_estimators': [300, 500, 800, 1000],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
Mscaler = MinMaxScaler(feature_range=(0,1))
X[25]= pd.DataFrame(Mscaler.fit_transform(X[['PRODUCT_SIZE']]))
X_ = X.drop(['PRODUCT_SIZE'], axis=1)
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_, y, test_size=test_size, random_state=random_state)
# Create a RandomForestRegressor
RF = RandomForestRegressor(random_state=random_state, n_jobs=-1)
# Perform RandomizedSearchCV
rand_search = RandomizedSearchCV(estimator=RF, param_distributions=param_grid, cv=2, n_jobs=-1, verbose=2)
# Fit the model
rand_search.fit(X_train, y_train)
# Print best parameters and score
print("Best Parameters:", rand_search.best_params_)
print("Best Score:", rand_search.best_score_)
# Evaluate on test set
test_score = rand_search.score(X_test, y_test)
print("Test Set Score:", test_score)
# Example usage
# hyperparameter_tuning_RandomForest(X, y)
hyperparameter_tuning_RandomForest(X, y, test_size=0.2, random_state=42)
Fitting 2 folds for each of 10 candidates, totalling 20 fits
C:\Users\dmaka\anaconda3\envs\Tflow1\lib\site-packages\sklearn\model_selection\_search.py:909: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
Best Parameters: {'n_estimators': 800, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 30}
Best Score: 0.8999592139431523
Test Set Score: 0.9035918918598134
def hyperparameter_tuning_XGBoost(X, y, test_size=0.2, random_state=42):
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
# Define parameter grid for GridSearchCV
param_grid = {
'n_estimators': [300, 600, 800],
'max_depth': [3, 5, 10],
'learning_rate': [0.01, 0.03 ,0.05, 0.1],
'colsample_bytree': [0.8, 0.9, 1.0],
'gamma': [0, 1, 2]
}
Mscaler = MinMaxScaler(feature_range=(0,1))
X[25]= pd.DataFrame(Mscaler.fit_transform(X[['PRODUCT_SIZE']]))
X_ = X.drop(['PRODUCT_SIZE'], axis=1)
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_, y, test_size=test_size, random_state=random_state)
# Create a RandomForestClassifier
XG = XGBRegressor(random_state=random_state, n_jobs=-1)
# Perform GridSearchCV
rand_search = RandomizedSearchCV(estimator=XG, param_distributions=param_grid, cv=2, n_jobs=-1, verbose=2)
# Fit the model
rand_search.fit(X_train, y_train)
# Print best parameters and score
print("Best Parameters:", rand_search.best_params_)
print("Best Score:", rand_search.best_score_)
# Evaluate on test set
test_score = rand_search.score(X_test, y_test)
print("Test Set Score:", test_score)
hyperparameter_tuning_XGBoost(X, y, test_size=0.2, random_state=42)
Fitting 2 folds for each of 10 candidates, totalling 20 fits
Best Parameters: {'n_estimators': 300, 'max_depth': 10, 'learning_rate': 0.1, 'gamma': 0, 'colsample_bytree': 0.9}
Best Score: 0.8618847570379943
Test Set Score: 0.8634410050433886
With these features, XG Boost model performs the best with training and testing accuracy of 86 %.
Although the Randomized Search CV was applied, Random Forest wasn’t able to give better accuracy than it already did in the first run.
Out of all the models Decision Tree and Random Forest Regressor are giving promising results on both training and testing data.